智能论文笔记

Avast-CTU Public CAPE Dataset

Branislav Bosansky , Dominik Kouba , Ondrej Manhal , Thorsten Sick , Viliam Lisy , Jakub Kroustek , Petr Somol

分类：人工智能 | 机器学习

2022-09-06

有限的公开数据可以支持恶意软件分析技术的研究。特别是，几乎没有由杜鹃/斗篷等丰富的沙盒生成的公开可用数据集。使用动态沙箱的好处是对目标机中文件执行的逼真模拟并获得该执行日志。机器可以被恶意软件感染，因此很有可能在执行日志中捕获恶意行为，从而使研究人员可以详细研究这种行为。尽管随后对日志信息的分析在工业网络安全后端被广泛介绍，但据我们所知，仅在学术界投入了有限的努力，以使用最先进的技术提高此类日志分析功能。我们使此示例数据集可用来支持设计新的机器学习方法以进行恶意软件检测，尤其是用于自动检测通用恶意行为。该数据集是在Avast软件和捷克技术大学-AI中心（AIC）之间合作的。

translated by 谷歌翻译

Explaining Classifiers Trained on Raw Hierarchical Multiple-Instance Data

Tomáš Pevný , Viliam Lisý , Branislav Bošanský , Petr Somol , Michal Pěchouček

分类： (统计)机器学习 | 机器学习

2022-08-04

从原始数据输入中学习，因此限制了功能工程的需求，是机器学习方法在各个域中的许多成功应用的组成部分。尽管许多问题自然地转化为直接在标准分类器中使用的矢量表示形式，但许多数据源具有结构化数据互换格式的自然形式（例如，以JSON/XML格式使用的安全日志）。现有方法，例如在层次多实例学习（HMIL）中，允许以原始形式从此类数据中学习。但是，对原始结构化数据培训的分类器的解释仍然在很大程度上尚未探索。通过将这些模型视为子集选择问题，我们证明了如何使用计算有效算法来生成具有优惠属性的可解释解释。我们与图形神经网络采用的解释技术进行比较，该技术显示了速度加速和更高质量的解释的顺序。

translated by 谷歌翻译

Neural Transition-based Parsing of Library Deprecations

Petr Babkin , Nacho Navarro , Salwa Alamir , Sameena Shah

分类：自然语言处理

2022-12-23

This paper tackles the challenging problem of automating code updates to fix deprecated API usages of open source libraries by analyzing their release notes. Our system employs a three-tier architecture: first, a web crawler service retrieves deprecation documentation from the web; then a specially built parser processes those text documents into tree-structured representations; finally, a client IDE plugin locates and fixes identified deprecated usages of libraries in a given codebase. The focus of this paper in particular is the parsing component. We introduce a novel transition-based parser in two variants: based on a classical feature engineered classifier and a neural tree encoder. To confirm the effectiveness of our method, we gathered and labeled a set of 426 API deprecations from 7 well-known Python data science libraries, and demonstrated our approach decisively outperforms a non-trivial neural machine translation baseline.

translated by 谷歌翻译

Unsigned Play by Milan Kundera? An Authorship Attribution Study

Lenka Jungmannová , Petr Plecháč

分类：自然语言处理

2022-12-19

In addition to being a widely recognised novelist, Milan Kundera has also authored three pieces for theatre: The Owners of the Keys (Majitel\'e kl\'i\v{c}\r{u}, 1961), The Blunder (Pt\'akovina, 1967), and Jacques and his Master (Jakub a jeho p\'an, 1971). In recent years, however, the hypothesis has been raised that Kundera is the true author of a fourth play: Juro J\'ano\v{s}\'ik, first performed in a 1974 production under the name of Karel Steigerwald, who was Kundera's student at the time. In this study, we make use of supervised machine learning to settle the question of authorship attribution in the case of Juro J\'ano\v{s}\'ik, with results strongly supporting the hypothesis of Kundera's authorship.

translated by 谷歌翻译

Flowstorm: Open-Source Platform with Hybrid Dialogue Architecture

Jan Pichl , Petr Marek , Jakub Konrád , Petr Lorenc , Ondřej Kobza , Tomáš Zajíček , Jan Šedivý

分类：人工智能

2022-12-19

This paper presents a conversational AI platform called Flowstorm. Flowstorm is an open-source SaaS project suitable for creating, running, and analyzing conversational applications. Thanks to the fast and fully automated build process, the dialogues created within the platform can be executed in seconds. Furthermore, we propose a novel dialogue architecture that uses a combination of tree structures with generative models. The tree structures are also used for training NLU models suitable for specific dialogue scenarios. However, the generative models are globally used across applications and extend the functionality of the dialogue trees. Moreover, the platform functionality benefits from out-of-the-box components, such as the one responsible for extracting data from utterances or working with crawled data. Additionally, it can be extended using a custom code directly in the platform. One of the essential features of the platform is the possibility to reuse the created assets across applications. There is a library of prepared assets where each developer can contribute. All of the features are available through a user-friendly visual editor.

translated by 谷歌翻译

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Esaú Villatoro-Tello , Srikanth Madikeri , Juan Zuluaga-Gomez , Bidisha Sharma , Seyyed Saeed Sarfjoo , Iuliia Nigmatulina , Petr Motlicek , Alexei V. Ivanov , Aravind Ganapathiraju

分类：自然语言处理 | 人工智能

2022-12-16

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs allows SLU systems to improve in comparison to the 1-best setup (4% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, and a relative improvement of 18% over the 1-best configuration. Thus, crossmodal architectures represent a good alternative to overcome the limitations of working purely automatically generated textual data.

translated by 谷歌翻译

Speech and Natural Language Processing Technologies for Pseudo-Pilot Simulator

Amrutha Prasad , Juan Zuluaga-Gomez , Petr Motlicek , Saeed Sarfjoo , Iuliia Nigmatulina , Karel Vesely

分类：自然语言处理 | 人工智能 | 机器学习

2022-12-14

This paper describes a simple yet efficient repetition-based modular system for speeding up air-traffic controllers (ATCos) training. E.g., a human pilot is still required in EUROCONTROL's ESCAPE lite simulator (see https://www.eurocontrol.int/simulator/escape) during ATCo training. However, this need can be substituted by an automatic system that could act as a pilot. In this paper, we aim to develop and integrate a pseudo-pilot agent into the ATCo training pipeline by merging diverse artificial intelligence (AI) powered modules. The system understands the voice communications issued by the ATCo, and, in turn, it generates a spoken prompt that follows the pilot's phraseology to the initial communication. Our system mainly relies on open-source AI tools and air traffic control (ATC) databases, thus, proving its simplicity and ease of replicability. The overall pipeline is composed of the following: (1) a submodule that receives and pre-processes the input stream of raw audio, (2) an automatic speech recognition (ASR) system that transforms audio into a sequence of words; (3) a high-level ATC-related entity parser, which extracts relevant information from the communication, i.e., callsigns and commands, and finally, (4) a speech synthesizer submodule that generates responses based on the high-level ATC entities previously extracted. Overall, we show that this system could pave the way toward developing a real proof-of-concept pseudo-pilot system. Hence, speeding up the training of ATCos while drastically reducing its overall cost.

translated by 谷歌翻译

Predicting article quality scores with machine learning: The UK Research Excellence Framework

Mike Thelwall , Kayvan Kousha , Mahshid Abdoli , Emma Stuart , Meiko Makita , Paul Wilson , Jonathan Levitt , Petr Knoth , Matteo Cancellieri

分类：人工智能

2022-12-11

National research evaluation initiatives and incentive schemes have previously chosen between simplistic quantitative indicators and time-consuming peer review, sometimes supported by bibliometrics. Here we assess whether artificial intelligence (AI) could provide a third alternative, estimating article quality using more multiple bibliometric and metadata inputs. We investigated this using provisional three-level REF2021 peer review scores for 84,966 articles submitted to the UK Research Excellence Framework 2021, matching a Scopus record 2014-18 and with a substantial abstract. We found that accuracy is highest in the medical and physical sciences Units of Assessment (UoAs) and economics, reaching 42% above the baseline (72% overall) in the best case. This is based on 1000 bibliometric inputs and half of the articles used for training in each UoA. Prediction accuracies above the baseline for the social science, mathematics, engineering, arts, and humanities UoAs were much lower or close to zero. The Random Forest Classifier (standard or ordinal) and Extreme Gradient Boosting Classifier algorithms performed best from the 32 tested. Accuracy was lower if UoAs were merged or replaced by Scopus broad categories. We increased accuracy with an active learning strategy and by selecting articles with higher prediction probabilities, as estimated by the algorithms, but this substantially reduced the number of scores predicted.

translated by 谷歌翻译

SoftCTC $\unicode{x2013}$ Semi-Supervised Learning for Text Recognition using Soft Pseudo-Labels

Martin Kišš , Michal Hradiš , Karel Beneš , Petr Buchal , Michal Kula

分类：机器学习 | 计算机视觉

2022-12-05

This paper explores semi-supervised training for sequence tasks, such as Optical Character Recognition or Automatic Speech Recognition. We propose a novel loss function $\unicode{x2013}$ SoftCTC $\unicode{x2013}$ which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence based filtering step which is otherwise a crucial component of pseudo-labeling approaches to semi-supervised learning. We demonstrate the effectiveness of our method on a challenging handwriting recognition task and conclude that SoftCTC matches the performance of a finely-tuned filtering based pipeline. We also evaluated SoftCTC in terms of computational efficiency, concluding that it is significantly more efficient than a na\"ive CTC-based approach for training on multiple transcription variants, and we make our GPU implementation public.

translated by 谷歌翻译

WeatherFusionNet: Predicting Precipitation from Satellite Data

Jiří Pihrt , Rudolf Raevskiy , Petr Šimánek , Matej Choma

分类：计算机视觉 | 机器学习

2022-11-30

The short-term prediction of precipitation is critical in many areas of life. Recently, a large body of work was devoted to forecasting radar reflectivity images. The radar images are available only in areas with ground weather radars. Thus, we aim to predict high-resolution precipitation from lower-resolution satellite radiance images. A neural network called WeatherFusionNet is employed to predict severe rain up to eight hours in advance. WeatherFusionNet is a U-Net architecture that fuses three different ways to process the satellite data; predicting future satellite frames, extracting rain information from the current frames, and using the input sequence directly. Using the presented method, we achieved 1st place in the NeurIPS 2022 Weather4Cast Core challenge. The code and trained parameters are available at \url{https://github.com/Datalab-FIT-CTU/weather4cast-2022}.

translated by 谷歌翻译